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Simulation  of  Longitudinal  Exposure  Data  with 
Variance-Covariance  Structures  Based  on  Mixed  Models 

Peng  Song,1  Jianping  Xue,2  *  and  Zhilin  Li3 


Longitudinal  data  are  important  in  exposure  and  risk  assessments,  especially  for  pollutants 
with  long  half-lives  in  the  human  body  and  where  chronic  exposures  to  current  levels  in  the 
environment  raise  concerns  for  human  health  effects.  It  is  usually  difficult  and  expensive  to 
obtain  large  longitudinal  data  sets  for  human  exposure  studies.  This  article  reports  a  new 
simulation  method  to  generate  longitudinal  data  with  flexible  numbers  of  subjects  and  days. 
Mixed  models  are  used  to  describe  the  variance-covariance  structures  of  input  longitudinal 
data.  Based  on  estimated  model  parameters,  simulation  data  are  generated  with  similar  sta¬ 
tistical  characteristics  compared  to  the  input  data.  Three  criteria  are  used  to  determine  sim¬ 
ilarity:  the  overall  mean  and  standard  deviation,  the  variance  components  percentages,  and 
the  average  autocorrelation  coefficients.  Upon  the  discussion  of  mixed  models,  a  simulation 
procedure  is  produced  and  numerical  results  are  shown  through  one  human  exposure  study. 
Simulations  of  three  sets  of  exposure  data  successfully  meet  above  criteria.  In  particular,  sim¬ 
ulations  can  always  retain  correct  weights  of  inter-  and  intrasubject  variances  as  in  the  input 
data.  Autocorrelations  are  also  well  followed.  Compared  with  other  simulation  algorithms, 
this  new  method  stores  more  information  about  the  input  overall  distribution  so  as  to  sat¬ 
isfy  the  above  multiple  criteria  for  statistical  targets.  In  addition,  it  generates  values  from 
numerous  data  sources  and  simulates  continuous  observed  variables  better  than  current  data 
methods.  This  new  method  also  provides  flexible  options  in  both  modeling  and  simulation 
procedures  according  to  various  user  requirements. 


KEY  WORDS:  Autocorrelation:  longitudinal  data;  mixed  models;  simulation;  variance-covariance 
structure 


1.  INTRODUCTION 

Longitudinal  data  on  the  intensity  of  time- 
varying  sources  of  exposures  are  extremely  im¬ 
portant  for  epidemiological  studies,  environmental 
exposure  modeling,  and  risk  assessment  because 
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they  possess  variance  and  covariance  structures 
that  cross-sectional  data  lack.  However,  it  is  very 
difficult  and  expensive  to  track  human  activities, 
environmental  measurements,  and  other  infor¬ 
mation  for  the  same  subject  through  extended 
periods  of  time  to  obtain  observed  longitudinal  data. 
Some  studies  do  provide  longitudinal  data  but  with 
small  sample  size  and  short  duration,  such  as  the 
Harvard  Southern  California  Ozone  Exposure 
Study/1!  PM2.5  Panel  Studies/2!  and  the  Detroit 
Exposure  and  Aerosol  Research  Study. (3)  This  lack 
of  data  restricts  the  applicability  of  observed  longitu¬ 
dinal  data  for  studies  of  broader  spatial  or  temporal 
scope. 

Computer  simulations  could  help  overcome  the 
above  limitation.  In  this  article,  we  first  build 
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statistical  models  for  available  observed  longitudi¬ 
nal  data,  and  then  develop  an  algorithm  to  gener¬ 
ate  large  amounts  of  longitudinal  data  with  flexible 
numbers  of  exposure  days  and  close  overall  distribu¬ 
tion  with  the  observed  data.  Three  criteria  are  used 
to  evaluate  this  simulation  method.  First,  the  overall 
mean  and  standard  deviation  ( SD )  should  be  close 
to  those  from  observed  data.  Second,  the  variances 
because  of  factors  such  as  inter  and  intrasubject,  sea¬ 
sonal,  and  other  characteristics  of  the  simulated  data 
should  be  similar  to  the  corresponding  variances  of 
the  observed  data.  Third,  autocorrelations  should  be 
consistent  between  the  simulated  and  observed  data. 
When  these  three  criteria  are  met,  the  generated  sim¬ 
ulation  data  can  avoid  misclassification  of  variance 
components  and  be  used  for  various  models,  such 
as  SFIEDS-Multimedia  (Stochastic  Human  Exposure 
and  Dose  Simulation  Model  for  Multimedia,  Multi¬ 
pathway  Pollutants). (4_8) 

The  challenge  of  the  above  approach  lies  in  how 
to  model  and  simulate  the  complicated  variance- 
covariance  structures  of  observed  input  data.  In  lon¬ 
gitudinal  data,  there  is  variation  among  subjects. 
Moreover,  there  are  both  variation  and  correlation 
within  each  subject.  Traditional  methods  usually  gen¬ 
erate  all  data  independently,  so  that  the  correlations 
within  subjects  are  ignored.  In  statistics,  mixed  mod¬ 
els  are  developed  to  make  this  issue  clear.  In  mixed 
models,  total  data  variance  is  divided  into  that  be¬ 
tween  subjects  (intersubject)  and  that  within  subjects 
(intrasubject).  Then,  we  can  model  several  types  of 
correlations  within  each  subject  as  necessary,  to  ac¬ 
curately  simulate  the  variance-covariance  structures 
in  the  observed  data.  In  this  article,  we  will  present 
the  use  of  mixed  methods,  and  the  simulation  pro¬ 
cedure  based  on  them.  Then  we  evaluate  this  new 
method  with  some  observed  exposure  data  and  pro¬ 
vide  simulation  results.  We  also  compare  this  new 
method  with  other  simulation  methods  to  demon¬ 
strate  its  features. 

2.  MIXED  MODELS  AND  SIMULATION 
METHODS 

2.1.  Basic  Mixed  Models 


including  variation  among  subjects,  variation  over 
time,  and  variation  because  of  measurement  error.  In 
this  model,  random  terms  are  assumed  independent 
and  normally  distributed  among  and  within  subjects. 
The  normal  distribution  issue  will  be  discussed  later 
in  this  section. 

As  a  modified  version  of  Equation  (1),  mixed 
model  discriminates  intersubject  and  intrasubject 
variances,  by  splitting  e,y  into  two  terms: 

y-ij  =  /x  +  £>;  +  elj,bl  ~  N(0,crl),etj  ~  A(0,cr;) ,  (2) 


where  b,  is  the  random  effect  of  subject  /,  and  e,y  is 
the  random  term  for  other  variation  in  its  yth  ob¬ 
servation.  Here,  b, -s  are  assumed  to  be  independent 
among  subjects,  and  eys  are  assumed  to  be  indepen¬ 
dent  among  and  within  subjects.  In  this  way,  observa¬ 
tions  from  subject  i  share  a  common  term  b,  to  retain 
their  correlation,  and  meanwhile  possess  their  own 
random  terms  e,ys  to  quantify  intrasubject  variability. 
Take  an  example  in  which  every  subject  has  obser¬ 
vations  for  four  consecutive  days.  Under  the  assump¬ 
tions  in  Equation  (2),  the  variance-covariance  matrix 
of  one’s  four-day  series  (4x1  random  vector)  is: 


S4x4  =  Var 
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where 


Traditional  simulation  methods  for  longitudinal 
data  are  usually  based  on  the  following  model  under 
assumption  of  independence: 

y,y  =  /x  +  Sjj , Eij  ~  N(0,  a2),  (1) 

where  y,y  is  the  /th  observation  of  the  /th  subject;  /x  is 
the  mean  of  all  observations;  e,y  is  the  random  term 


(°6+W2)  (3a) 

is  the  intraclass  correlation  coefficient  (ICC)  in  statis¬ 
tics*9^  and  exposure  studies. (10) 

The  SAS  procedures  PROC  MIXED  or  PROC 
GLM  can  provide  estimates  of  the  parameters  /x, 
at2,  ae2  in  model  (2).*n*  To  simulate  an  rc-day  series 
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for  subject  i,  bj  is  generated  first  for  the  subject.  Af¬ 
ter  that,  e,i,  ■  ■  e,„  are  generated  independently  and 

added  to  b,  for  day  1  through  day  n.  This  two-stage 
simulation  procedure  maintains  both  variation  and 
correlation  within  each  subject,  as  well  as  the  varia¬ 
tion  among  subjects.  Finally,  we  can  add  the  constant 
p  to  every  b,  +  e,y  to  meet  the  overall  mean  and  ob¬ 
tain  the  simulation  longitudinal  data  set. 


2.2.  Mixed  Models  with  Autocorrelation 


In  the  basic  mixed  model  (2)  and  its  variance- 
covariance  matrix  (3),  random  terms  et\ ,  . . .,  ein  are 
assumed  to  be  independent.  This  means  any  two  sets 
of  days  from  one  subject  have  equal  ICC  of  at,2 1 
( at, 2  +  oe2).  However,  sometimes  this  is  not  true. 
For  example,  some  high-level  exposures  tend  to  oc¬ 
cur  consecutively.  If  so,  data  from  two  closer  days 
are  likely  to  have  higher  correlations.  This  property 
is  called  autocorrelation  in  longitudinal  data.  As  a  re¬ 
sult,  random  term  e,y  in  model  (2)  is  no  longer  entirely 
independent  from  the  others.  Instead,  it  partially  de¬ 
pends  on  its  preceding  term  e,-j_  i .  For  this,  we  fur¬ 
ther  separate  e,y  into  two  terms:  one  is  determined  by 
e,  ,_  | ,  and  the  other  is  independent  from  all: 

eij  =  peij-i  +  Sij,  (4a) 


where  p  is  the  autocorrelation  coefficient  between 
two  consecutive  days,  or  lag-one  autocorrelation, 
—  1  <  p  <  1.  Then  model  (2)  is  modified  to  the 
following: 

yij  =  p  +  bi+  peij- 1  +  Sij,Sij~N( 0,  (l-p2)cr2),  (4) 


where  .v(/  is  independent  from  all.  Its  rescaled  vari¬ 
ance  (1  -  p2)  a  2  is  to  keep  total  variance  within 
one  subject  equal  to  a2.  Under  this  assumption,  the 
variance-covariance  matrix  becomes/11) 

(  bj  +  en  \ 
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(5) 

Among  the  two  matrices  in  Equation  (5),  the  first 
one  defines  intersubject  variances  equivalently  as  in 


Equation  (3),  and  the  second  one  allows  observations 
with  a  /v-tlay  lag  to  have  a  correlation  coefficient  of 
pk.  This  is  closer  to  reality  in  some  cases,  compared 
with  covariance  matrix  (3),  which  requires  observa¬ 
tions  from  any  two  days  to  have  equal  correlations. 

For  our  simulations,  we  need  to  estimate  pa¬ 
rameter  p  in  Equation  (4),  besides  ct;,2,  a  2  in  (2). 
For  input  data  over  a  short  time  period,  SAS  PROC 
MIXED  can  estimate  all  parameters  directly  by  the 
maximum  likelihood  method.  For  input  data  over 
long  time  periods,  say,  more  than  10  days,  that  al¬ 
gorithm  can  fail  because  of  large  computation  loads. 
Instead,  we  can  use  PROC  GLM  first  to  get  esti¬ 
mates  of  /x,  op2,  oe2,  and  save  all  residuals  e,y.  PROC 
ARIMA  can  be  used  on  residuals  en,...,  ein  to  calcu¬ 
late  the  autocorrelation  in  subject  i.  The  average  of 
all  subjects’  autocorrelation  coefficients  is  a  reason¬ 
able  estimate  for  p  in  the  population. 

2.3.  Mixed  Models  with  Classification  Variables 

The  mixed  models  discussed  above  are  for  lon¬ 
gitudinal  data  with  only  subject  and  time  informa¬ 
tion.  Oftentimes,  some  classifications  for  subjects  and 
observation  periods  are  desired.  For  example,  sub¬ 
jects  can  be  classified  by  groups  according  to  gen¬ 
der,  age,  or  living  district.  Similarly,  observation  days 
can  be  classified  by  treatments  applied  in  various 
month,  season,  or  different  categories.  It  could  be 
important  to  study  how  much  variance  is  attributed 
to  these  classification  effects.  In  mixed  models,  these 
classifications  are  modeled  as  fixed  effects,  distinct 
from  random  effects  such  as  bj .  Suppose  a  subject 
k  belongs  to  the  group  i,  and  its  /th  observation  is 
taken  under  treatment  j  (for  example,  the  yth  sea¬ 
son).  Then,  this  observation,  labeled  as  y^i,  can  be 
modeled  as  following: 

yijkl  =  p.  Ctj  fi  j  -}-  (u/i  );y 

T  bb(jj  pejjk.i- 1  T  Sjjbi ,  (6a) 

or  equivalently, 

yijlcl  =  1 4 ij  T  bk(i)  +  pejjk.l-l  T  Sijkl,  (6) 

where  /x,y  —  p  +  aj  +  fij  +  (uP)jj  is  the  mean  of 
all  data  from  group  i  and  under  treatment  j;  b^p 
is  the  random  effect  of  the  subject  k  from  group  i\ 
pejjk.i-i  +  Sjjki  is  the  random  term  of  that  observa¬ 
tion.  This  model  is  referred  as  the  split-plot  model  in 
statistics/12) 

Estimates  of  pLj  can  be  obtained  using  SAS 
PROC  MIXED  or  PROC  GLM.  For  our  simulations, 
we  first  build  variance  structures  +  pe^j-i  +  Sjjki 
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as  described  in  the  above  sections.  Then,  we  assign 
classification  levels  (for  example,  four  seasons,  two 
genders)  according  to  their  frequency  percentages  in 
the  input  data.  Finally,  we  add  up  the  corresponding 
/iy  by  assigned  classification  levels  to  build  the  fixed 
effects.  If  the  input  data  set  contains  classification 
structures  not  as  typical  as  in  model  (6),  we  can  al¬ 
ways  convert  it  to  satisfy  model  (6).  When  there  is  no 
group  classification  in  input  data,  we  can  simply  es¬ 
tablish  a  dummy  variable  (all  zero  values)  as  a  virtual 
group  effect  (a,  =  0).  If  there  are  two  treatment  ef¬ 
fects,  we  can  incorporate  their  information  into  one 
virtual  effect  to  fit  model  (6).  For  example,  suppose 
we  are  not  only  interested  in  which  season  the  ob¬ 
servations  belong  to  (four  levels  of  treatment  effect), 
but  also  interested  in  whether  they  are  observed  on 
a  weekday  or  weekend  (two  levels  of  treatment  ef¬ 
fect).  We  can  combine  both  of  them  into  one  virtual 
treatment  effect  with  2x4  =  8  levels,  and  then  send 
it  to  the  simulation  module  as  fij  in  model  (6).  After 
simulation  data  are  generated,  we  can  separate  this 
virtual  effect  back  to  the  original  season  effect  and 
weekday/weekend  effect  according  to  the  combining 
rules.  By  repeating  this  combine-and-separate  pro¬ 
cedure,  we  can  even  handle  more  complicated  clas¬ 
sification  information  in  input  data.  This  technique 
will  considerably  broaden  the  applicability  of  model 
(6),  without  any  modification  on  the  core  simulation 
program. 

2.4.  Transforms  on  Input  Data 

In  the  mixed  models  above,  all  random  terms 
bj  and  e,;,  are  assumed  normally  distributed.  This  is 
the  basic  assumption  of  mixed  model  theory,  as  well 
as  the  requirement  of  the  SAS  procedures  we  used. 
That  means  input  data  yLj  or  y;//c /  also  roughly  fol¬ 
low  normal  distributions.  If  they  are  not  normally 
distributed,  we  can  apply  some  mathematical  trans¬ 
forms  on  input  variables  y,y  or  yjjki  to  rescale  them. 
For  example,  some  exposure  data  empirically  follow 
a  log-normal  distribution.  Thus,  we  can  take  loga¬ 
rithms  of  input  data,  and  then  build  model  (6)  on  the 
transformed  data: 

log  (yijki)  =  P  +  a,-  +  +  bk{i) 

+  Peijk,l- 1  +  Sykl-  (7) 

However,  sometimes  we  find  that  after  the  log¬ 
arithm  transform,  residuals  bk(i)  or  e^i  still  fail  to  fit 
a  normal  distribution.  That  may  cause  some  biased 
simulation  results,  such  as  a  higher  or  lower  overall 
SD  than  the  target  level  of  input  data.  One  alterna¬ 
tive  is  to  use  a  broader  family  of  transforms,  called 


the  Box-Cox  transforms, (131  of  which  the  logarithm 
transform  is  only  one  special  case: 


j(/-l)A  if  ^0 

|  log(y)  if  1  =  0. 


(8) 


When  1  approaches  to  zero,  the  Box-Cox  trans¬ 
form  is  almost  the  logarithm  transform,  because 
lim^o  =  log(y).  When  1  =  1,  the  Box-Cox 
transform  just  shifts  all  data  down  by  one  unit,  with¬ 
out  changing  the  variance-covariance  structures  of 
input  data.  With  Equation  (8)  implemented,  model 
(7)  can  be  generalized  to  the  following: 

/1\ 

yijki  =  M  +  a;  +  A+M)y  +  bm  +  Peijk,i-1  +  Sijki,  (9) 

The  optimal  Box-Cox  transform  can  be  automat¬ 
ically  completed  by  SAS  PROC  TRANSREG.  This 
module  goes  through  all  possible  1  values  (for  exam¬ 
ple,  from  -3  to  3  by  0.1)  and  evaluates  each  likeli¬ 
hood  for  model  (9).  Then,  the  1  value  with  maximum 
likelihood  is  selected.  In  this  way,  this  “smart”  Box- 
Cox  transform  can  determine  whether  the  input  data 
yijki  already  meet  the  normal  assumption  (k  =  1  se¬ 
lected),  or  whether  the  logarithm  should  be  taken 
(k  —  0  selected),  or  another  transform  (X  ^  0,  1  se¬ 
lected)  should  be  applied. 

Although  the  Box-Cox  transform  can  make  in¬ 
put  data  closer  to  a  normal  distribution,  there  are 
some  cases  it  cannot  help  with,  such  as  heavy  tails 
or  outliers  in  input  data.  In  these  cases,  transformed 
data  may  not  conform  to  the  assumptions  of  mixed 
models,  resulting  in  certain  bias  in  the  simulation. 


2.5.  Simulation  Procedure 

We  now  formulate  a  complete  procedure  to  sim¬ 
ulate  longitudinal  data  based  on  mixed  model  (9)  by 
SAS: 

Step  1:  Observed  input  data.  Compute  tar¬ 
get  statistics:  overall  mean  and  SD ,  vari¬ 
ance  components  percentages,  average  lag- 
one  autocorrelation  coefficient. 

Step  2:  If  input  data  have  more  classifica¬ 
tions  than  model  (6),  combine  them  into  one 
group  effect  and  one  treatment  effect  as  in 
model  (6),  and  then  keep  the  percentages  of 
all  levels  of  groups  and  treatments. 

Step  3:  Find  the  optimal  Box-Cox  transform 
as  in  Equation  (9)  by  SAS  PROC  TRAN¬ 
SREG. 

Step  4:  Estimate  all  model  parameters  in 
Equation  (9)  by  SAS  PROC  GLM  and 
PROC  MIXED. 
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Fig-1-  Flow  chart  of  simulation  of  longitudinal  data  by  the  complete  mixed  model  (9).  The  left  side  (J)-@  include  modeling  steps,  and  the 
right  side  (J)-®  include  simulation  steps. 


Step  5:  Input  required  numbers  of  subjects 
and  days  in  simulation  data,  generate  that 
amount  of  bk(i),  peijkj-i,  Sijki  from  corre¬ 
sponding  normal  distributions,  organize  and 
add  up  as  in  Equations  (4)  and  (9). 

Step  6:  Assign  group  and  treatment  levels  as 
their  percentages  in  Step  2.  Add  up  proper 
means  /x,;/  =  p  +  a,  +  fij  +  {aP)ij  for  data  in 
each  level. 

Step  7:  Transform  obtained  simulation  data 
ytjkPA)  back  to  original  scale  y^i,  as  inverse 
of  Step  3. 

Step  8:  Restore  group  and  treatment  effects  as 
in  input  data,  as  inverse  of  Step  2. 

Step  9:  Check  simulation  results.  Compute 
three  aspects  of  statistics  from  simulation 
data  and  compare  with  the  targets  set  up  in 
Step  1. 

Fig.  1  is  a  flow  chart  of  this  procedure  from  Step 
1-Step  9. 

3.  RESULTS 

The  PM2.5  (particulate  matter  less  than  2.5  pm 
in  diameter)  Panel  Studies*2'  took  observations  on 
personal,  indoor,  and  outdoor  PM  exposure  data  and 
other  variables  of  interest  from  37  participants  over 


four  seasons  from  June  2000  to  June  2001  (one  exam¬ 
ple  can  be  seen  in  Appendix  Fig.  Al).  The  involved 
subjects  came  from  two  socioeconomic  cohorts  liv¬ 
ing  in  Raleigh  and  Chapel  Hill,  respectively,  both 
in  North  Carolina.  Each  subject  was  expected  to  be 
monitored  on  seven  consecutive  days  in  each  season. 
Because  of  missing  data,  there  are  23  observations 
per  subject  on  average. 

We  conducted  simulation  experiments  on  three 
input  variables:  personal  PM,  indoor  PM,  and  out¬ 
door  PM.  We  present  results  of  outdoor  PM  as  an 
example  to  test  four  models  as  described  above  in 
a  tiered  order:  the  independence  model  (1),  the  ba¬ 
sic  mixed  model  (2),  the  mixed  model  with  lag-one 
autocorrelation  (AR[lj)  as  in  Equation  (4),  and  the 
complete  model  using  Box-Cox  transformed  data  as 
in  Equation  (9).  For  a  better  comparison  with  Equa¬ 
tion  (9),  we  took  logarithms  on  input  data  for  the  first 
three  models. 


1  =  P'ij  +  Sijki 

(L0a) 

=  P'ij  +  &k(i)  “1“  Sijki 

(10b) 

=  P'ij  bk(i)  “f“  P^ijk,l— 1  “1"  $ijkl 

(10c) 

yijkl  =  P'ij  b k(i )  “1“  Peijk,l—1  $ijkl 

(lOd) 
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Table  I.  Simulation  Results  of  Outdoor  PM  by  Four  Models  (10a)-(10d)  with  1,000  Subjects  Over  300  Days,  Compared  with  Target 

Statistics  Set  by  Actual  Observed  Data 


Method 

Overall  Scale 
(Mg/m3) 

Variance  Components  Percentages  (%) 

AR(1)  p 

Mean 

SD 

Inter 

Intra 

Cohort 

Season 

Observed 

20.0 

9.5 

13.5 

75.9 

0.5 

8.9 

0.39 

(10a) Independent 

20.2 

11.0 

0.3 

89.7 

0.4 

9.6 

0.10 

(10b)  Mixed 

20.3 

11.1 

8.1 

82.4 

0.5 

9.0 

0.10 

(10c)  Mixed  +  AR(1) 

20.2 

11.0 

8.1 

82.2 

0.3 

9.4 

0.37 

(10d)  Mixed  +  AR(1)  +  Box 

20.0 

9.5 

10.2 

79.5 

0.4 

9.9 

0.40 

Note:  Shaded  vertical  comparisons  show  how  simulation  results  are  improved  by  each  model  refinement.  In  row  (10b),  variance  components 
inter-  and  intrapercentages  are  corrected  when  mixed  model  is  used.  In  row  (10c),  observed  autocorrelation  coefficient  is  approached  when 
AR(1)  is  added  to  model.  In  row  (lOd),  overall  standard  deviation  ( SD )  is  adjusted  when  Box-Cox  transform  is  used  to  replace  logarithm 
transform. 


In  Table  I,  the  first  row  in  bold  labeled  “Ob¬ 
served”  shows  target  statistics  associated  with  the  in¬ 
put  data.  Below  that,  the  simulation  results  from  the 
four  models  are  presented  in  order.  The  shaded  ver¬ 
tical  comparisons  show  how  certain  target  statistics 
are  improved  by  upgrading  the  above  model  to  the 
one  below.  From  the  independence  model,  we  see 
that  the  intersubject  variance  percentage  is  almost  di¬ 
minished  to  zero,  whereas  the  intrasubject  variance 
percentage  is  much  higher  than  its  target.  When  the 
basic  mixed  model  is  used,  these  two  variance  com¬ 
ponents  immediately  get  closer  to  their  target  per¬ 
centages.  These  two  are  discussed  more  at  the  end  of 
this  section.  When  the  mixed  model  is  further  modi¬ 
fied  with  autocorrelation,  we  see  that  the  autocorre¬ 
lation  coefficient  is  raised  to  0.37,  comparable  to  the 
observed  value  of  0.39.  Finally,  when  we  improve  the 
logarithm  transform  to  the  optimal  Box-Cox  trans¬ 
form,  the  simulation  overall  SD  is  slightly  adjusted 
from  11.0  to  9.5. 

Parallel  simulations  for  personal  PM  and  indoor 
PM  gave  similar  results  as  in  Table  I:  simulations  by 
the  last  model  “Mixed  +  AR(1)  +  Box”  work  well 
to  approach  observed  targets  in  all  aspects.  When 
fixed  effects  such  as  cohort  and  season  take  very  few 
percentages  in  observed  data  (less  than  1%),  they 
can  be  considered  insignificant  and  omitted  in  the 
simulation. 

We  also  studied  the  trend  of  autocorrelation  for 
increasing  simulation  time  periods  in  Fig.  2.  The  ob¬ 
served  outdoor  PM  data  contain  much  larger  auto¬ 
correlation  (p  =  0.40)  than  the  observed  personal  PM 
data  (p  =  0.08).  From  Fig.  2,  both  simulation  auto¬ 
correlations  start  from  a  low  level,  increase  gradually 
with  time,  and  eventually  become  stable.  For  outdoor 
PM,  the  simulation  autocorrelation  converges  to  its 
observed  level  accurately.  For  personal  PM,  the  sim- 
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Fig.  2.  Simulation  on  autocorrelation  of  personal  PM  (labeled 
simu.pers)  and  outdoor  PM  (labeled  simu.out)  for  increasing 
time  lengths  with  1,000  subjects,  compared  with  their  observed 
autocorrelations. 


ulation  autocorrelation  is  a  little  higher  than  its  ob¬ 
served  value,  which  is  in  fact  very  weak.  If  the  ob¬ 
served  autocorrelation  can  be  verified  insignificant, 
we  can  simply  force  p  to  be  zero  and  apply  the  model 
without  autocorrelation. 

In  Fig.  3,  we  ran  the  simulation  on  personal 
PM  for  increasing  days  to  explore  the  trends  of 
intersubject  and  intrasubject  variance  percentages. 
Simulation  results  of  the  independence  model  (1) 
are  also  shown  for  comparison.  The  independence 
model  reduces  intersubject  variance  toward  zero  as 
the  simulation  period  is  lengthened,  whereas  the 
mixed  model  maintains  intersubject  variance  close  to 
its  observed  target.  Accordingly,  the  target  intrasub¬ 
ject  variance  is  also  well  approached  by  the  mixed 
model.  Another  observation  is  that  simulation  inter¬ 
subject  percentage  tends  to  decrease  and  then  get 
stabilized,  whereas  intrasubject  percentage  goes  in 
the  opposite  direction.  This  is  because  inter-  and  in- 
trasubject  variances  percentages  are  also  affected  by 
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Fig.  3.  Simulation  of  personal  PM  inter-  and  intrasubject  variances 
percentages  with  1,000  subjects  over  increasing  time  lengths  by 
mixed  model  (labeled  “simu_”)  and  independence  model  (labeled 
“mO_”),  compared  with  their  observed  target  percentages  (labeled 
“obs_”). 

data  length,  especially  in  short-term  data  (less  than 
30  days).  We  will  examine  this  issue  further  below. 

For  short-term  longitudinal  data,  Glen  et  al.<l()> 
proposed  a  relationship  between  true  (long-run) 
intersubject  variance  percentage  D ,  and  its  observed 
value  D*  from  a  sample  of  length  N:  D*  —  D  +  (1 
-  D)IN.  From  this  equation,  a  short-term  sample 
tends  to  give  an  overestimated  intersubject  vari¬ 
ability.  It  is  recommended  to  correct  the  observed 
intersubject  variance  percentage  by  D  =  (ND*  -  1)/ 
(N  -  1)  for  a  better  target,  instead  of  being  very 
sensitive  to  data  length.  The  intrasubject  variance 
percentage  will  be  corrected  accordingly  (shifted  up 
a  little  bit)  because  all  other  fixed  effects  such  as 
season  and  cohort  always  keep  their  percentages  in 
total  variance  when  data  are  lengthened.  In  Table 
II,  we  add  the  corrected  targets  in  parentheses  after 
observed  targets  for  all  three  variables  and  compare 
simulation  results  with  them.  When  simulations  are 
run  based  on  entire  observed  data,  the  inter-  and  in¬ 
traresults  successfully  approach  the  corrected  target 
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values.  Moreover,  it  is  interesting  to  test  how  the  sim¬ 
ulations  perform  if  fewer  sample  data  are  available. 
We  use  half  of  the  observed  data  (last  two  seasons,  13 
days)  to  run  our  simulations,  and  find  that  simulation 
results  on  inter-  and  intrasubject  variances  per¬ 
centages  have  larger  relative  errors  to  the  corrected 
targets.  Results  for  the  other  targets  are  similar  to  Ta¬ 
ble  I.  Simulations  based  on  even  shorter  samples  (less 
than  10  days)  are  not  recommended  because  large 
biases  in  the  inter-  and  intravariances  percentages 
would  appear.  These  results  also  agree  with  a  previ¬ 
ous  study, (14)  which  reported  that  at  least  10  days  of 
observations  are  needed  to  capture  a  reliable  ICC. 

4.  DISCUSSION 

This  article  reports  a  new  simulation  method  for 
longitudinal  data.  A  series  of  mixed  models  are  ap¬ 
plied  to  describe  variance-covariance  structures  of 
input  longitudinal  data.  As  outlined  in  Fig.  1,  input 
longitudinal  data  are  analyzed  first  to  estimate  model 
parameters,  and  then  these  parameters  are  used  to 
generate  output  longitudinal  data  with  a  distribution 
closely  following  the  input  data.  The  output  distri¬ 
bution  is  checked  from  aspects  of  overall  mean,  SD , 
variance  components  percentages,  and  autocorrela¬ 
tion.  Three  data  sets  from  the  PM2.5  Panel  Studies 
are  used  to  test  this  new  method.  Most  simulation 
experiments  yield  accurate  and  robust  results  in  ap¬ 
proaching  input  data  targets. 

Compared  with  other  simulation  methods,  this 
new  method  has  the  following  features. 

The  first  feature  is  the  crucial  role  of  mixed  mod¬ 
els.  Most  of  our  efforts  were  focused  on  model  refine¬ 
ment  to  describe  the  input  data  accurately  and  com¬ 
prehensively.  If  one  model  can  fit  input  longitudinal 
data  very  well,  simulations  using  that  model  should 


Table  II.  Simulation  Inter-  and  Intrasubject  Variances  Percentages  Compared  with  Observed  Targets  and  Corrected  Targets  (in 
Parentheses)  for  Three  Input  Variables;  Simulation  Relative  Errors  to  Corrected  Targets  are  Also  Provided  Below 


Data 

PM_Personal 

PMJndoor 

PM.Outdoor 

Inter % 

Intra% 

Inter% 

Intra% 

Inter% 

Intra% 

Entire  Obs.  (23d) 

26.1  (23.0) 

71.6  (74.7) 

29.9  (27.0) 

66.5  (69.3) 

13.5  (10.3) 

75.9  (79.1) 

Simulation  (300d) 

21.6 

75.8 

26.9 

69.6 

10.2 

79.5 

Relative  Error  (%) 

6.3 

1.5 

0.4 

0.4 

1.0 

0.5 

Half  Obs.  (13d) 

26.7(21.4) 

67.2(72.5) 

30.1  (25.2) 

62.6  (67.5) 

15.0  (9.6) 

67.6  (72.9) 

Simulation  (150d) 

16.6 

77.8 

26.5 

64.8 

9.3 

73.5 

Relative  Error  (%) 

25.3 

7.1 

5.0 

4.1 

3.2 

0.8 

Note:  The  top  results  are  from  a  300-day  simulation  using  the  entire  input  observations  (23  days);  the  bottom  results  are  from  a  150-day 
simulation  using  half  of  input  observations  (winter  and  spring,  13  days).  Each  simulation  includes  1,000  subjects. 
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produce  good  results.  From  this  point,  this  simula¬ 
tion  method  first  serves  as  a  data  modeler  and  then 
as  a  data  generator.  Through  the  modeling  process, 
this  method  stores  a  lot  of  sample  information,  so 
that  it  can  simultaneously  satisfy  several  targets  to 
closely  replicate  the  input  distribution.  In  contrast, 
other  simulation  methods  are  mostly  designed  from 
only  one  or  two  statistical  aspects.  There  are  advan¬ 
tages  and  disadvantages  of  other  methods  and  ours. 
Our  method  has  the  relatively  stricter  requirements 
on  the  input  data  imposed  by  mixed  models. 

The  second  feature  is  the  simulation  data  source. 
Current  simulation  methods  usually  sample  data 
from  available  data  pools  either  by  random  draw 
methods  or  more  sophisticated  drawing  algorithms 
such  as  Glen’s  method. (10)  That  means  simulation 
data  can  only  take  limited  values  existing  in  pools. 
When  the  simulation  size  is  much  larger  than  the  pool 
size,  sampling  methods  can  result  in  forced  repeti¬ 
tions  of  limited  available  values,  and  make  the  sim¬ 
ulation  data  appear  discrete.  Flowever,  observed  ex¬ 
posure  data  are  usually  continuously  distributed.  Our 
new  method  can  emulate  this  property  because  it 
starts  by  generating  random  numbers  from  the  stan¬ 
dard  normal  distribution,  which  is  actually  an  infinite 
data  pool.  In  long  simulations,  new  values  will  always 
be  produced  in  the  output  data  set,  so  that  the  simu¬ 
lation  data  will  behave  close  to  a  continuous  distribu¬ 
tion. 

The  third  feature  is  the  flexibility  in  simulation 
practice.  This  method  has  been  coded  in  SAS  Macro 
language.  The  user  only  needs  to  input  a  longitu¬ 
dinal  data  set,  and  specify  how  many  subjects  and 
days  to  generate.  Then,  both  modeling  and  simula¬ 
tion  steps  in  Fig.  1  run  automatically  until  a  simula¬ 
tion  data  set  is  output  and  a  comparison  table  like 
Table  I  is  displayed.  The  user  has  options  to  specify 
which  variable  to  simulate,  what  classification  factors 
to  involve,  whether  the  autocorrelation  is  considered 
or  not,  and  what  kind  of  transforms  to  apply  on  in¬ 
put  variables.  The  user  can  also  simulate  a  subset  of 
input  data  with  particular  properties  if  necessary. 

There  are  certain  requirements  to  apply  the  tech¬ 
nique  to  assemble  the  longitudinal  data  such  as  con¬ 
tinuous  measurements,  normal  distributions  required 
by  mixed  model,  and  estimated  variance-covariance 
structures.  It  is  important  to  take  co-occurrence  into 
consideration  when  the  technique  is  used  in  model¬ 
ing  cumulative  exposures  for  multiple  chemicals. 

There  are  possible  extensions  for  our  simula¬ 
tion  method  in  future  studies.  First,  some  general¬ 
ized  linear  mixed  model  tools  have  been  recently 
developed  for  response  variables  that  are  not  nor¬ 


mally  distributed.  Using  these,  we  could  potentially 
fit  the  observed  data  more  accurately  and  obtain  bet¬ 
ter  simulation  results.  Second,  continuous  variables 
could  be  added  into  current  models  as  covariates, 
such  as  air  exchange  rates  that  affect  indoor  pollu¬ 
tion  levels,  in  addition  to  the  classification  variables 
already  included.  Third,  related  input  variables  could 
be  simultaneously  modeled  as  a  group  to  maintain 
inherent  correlations  among  them.  These  extensions 
would  generalize  our  method  significantly  and  are 
well  within  current-day  practices. 

5.  CONCLUSIONS 

The  new  technique  presented  in  this  article  uses 
variance-covariance  structure  and  autocorrelation 
coefficients  from  limited  longitudinal  data  to  simu¬ 
late  unlimited  longitudinal  data.  Inter-  and  intraper¬ 
sonal  variances  and  autocorrelation  are  close  to  the 
observed  longitudinal  data. 

The  new  method  will  be  important  for  exposure 
models  such  as  EPA’s  SHEDS-Multimedia  because 
it  can  be  used  to  simulate  a  series  of  important  input 
variables  by  keeping  their  variance-covariance  struc¬ 
ture  with  more  accurate  prediction  of  the  variance 
and  high  percentiles  of  exposure  output  by  the  mod¬ 
els.  This  could  be  valuable  for  linking  environmental 
pollutants  with  chronic  adverse  health  effects. 
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APPENDIX  A 


We  provide  a  figure  about  daily  personal  PM2.5 
profiles  of  one  given  individual  from  observed  data 
(28  days)  and  simulated  data  (365  days). 

APPENDIX  B 

We  provide  some  SAS  codes  for  core  simulation 
steps  4-6  as  in  Fig.  1.  We  used  the  version  of  SAS  9.2 
TS  Level  1M0. 

Suppose  we  are  going  to  simulate  one  variable 
from  the  input  longitudinal  data  set.  We  run  the  fol¬ 
lowing  macro  to  estimate  the  model  parameters  as  in 
step  4. 

Function:  Estimate  key  model  parameters 
for  variance  and  means; 

Input  :  sample:  objective_sample 
y:  interested  variable; 

Output:  sigma_b,  sigma_e,  rho : 

parameters  defined  in  mixed 
model 

means:  data  set  to  keep  classification 
means ,  i . e . ,  means  of 
each  group*trt  classification; 
%macro  model_parameters  (sample,  y)  ; 

/(global  sigma_b  sigma_e  rho; 

title  ‘Estimate  Mean  and  ANOVA  parameters1; 

proc  mixed  data  =  fesample 


class  group  trt  subject; 
model  &y  =  group|trt  /  s; 
random  subj ect (group) ; 
lsmeans  group*trt; 

ods  output  covparms  =  sigma  lsmeans  = 
means ; 
run; 

data  sigma_2  (keep  =  sigma_b  sigma_e) ; 
array  a(2)  sigma_b  sigma_e; 
do  _N_  =  1  to  2; 
set  sigma; 

a(_N_)  =  estimate; 

end; 

run; 

data  _n_; 

set  sigma_2; 

call  symput  (‘sigmaJb’,  sigmaJb)  ; 
call  symput  (‘sigma_e’,  sigma_e) ; 
run; 

title  ‘Estimate  AR(1)  on  Residues’; 
proc  glm  data  =  /(sample 
class  group  trt  subject; 
model  &y  =  group  trt  subject  (group)  ; 
output  out  =  residual  r  =  residual 
p  =  predicted; 
run; 

*  Note:  in  model,  rho  is  defined  by 

correlation  of  random  errors,  so  below 
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rho  is  calculated  upon  residues,  instead  of 
raw  data; 

°LarA  (sample  =  residual,  y  =  residual, 
sub  =  subject); 

proc  print  data  =  arjnean;  run; 
data  _n_; 

set  arjnean; 

call  symput  ('rho’,  autocorrelation); 

run; 

“/.put  &sigma_b  &sigma_e  &  rho 

%mend; 

Then  we  can  run  the  following  macro  for  simulation 
steps  5  and  6.  We  need  to  input  n_sub  and  n_day  to 
specify  the  size  of  simulation  data  set.  We  also  input 
the  model  parameters  obtained  from  above.  The  out¬ 
put  is  the  simulation  data  set. 

Function:  Main  step  of  FLA  method,  generate 
simulation  data  set.  See 
more  details  step  by  step  below. 

Input:  n_sub,  n_day:  how  many  subjects  and 
days  to  be  simulated 
sigma_b,  sigma_e,  rho:  key  model 
parameters  to  build  variance- 
covariance  structure 
n_group ,  n_trt :  numbers  of  group 
levels  and 
treatment  levels 

p_group,  p_trt :  percentage  of  each 
group  level  and  each 
treatment  level 

means:  classification  means  of  each 
group*treatment  level; 

Output:  simulation:  simulation  data  set; 

%macro  r_a  (n_sub,  n_day,  sigma_b,  sigma_e, 
rho,  n_group,  n_trt,  p_group,  p_trt,  means); 

*  First,  build  basic  model  with  proper 
variance — covariance  structure; 
data  simulation; 

do  i  =  1  to  &n_sub; 

b  =  rannor (0)*sqrt (&sigma_b)  ; 
do  j  =  1  to  &n_day; 
subject  =  i; 
day  =  j; 
b  =  b; 

if  j  =  1  then 

do;  e  =  rannor (0) *sqrt (&sigma_e) ; 
s  =  0 ;  output ;  end ; 

else  do; 

s  =  rannor (0) *sqrt ( (l-&rho*&rho) *&sigma_e)  ; 
e  =  &rho*e  +  s ; 


output ; 
end; 
end; 
end; 
run; 

*  Second,  modify  into  complete  model  by 
assigning  classification  levels  and 
level  means ; 

*  Assign  group  number; 

°/0 assign  (sample  =  simulation, 
var_l  =  subject, 
n_l  =  &n_sub, 
var_2  =  group , 
n_2  =  &n_group, 
proportion  =  &p_group) ; 

*  Assign  treatment  number; 

°/0 assign  (sample  =  simulation, 
var_l  =  day, 
n_l  =  &n_day, 
var_2  =  trt, 
n_2  =  &n_trt , 
proportion  =  &p_trt) ; 

*  Distribute  classification  means  according 
to  assigned  levels; 

"/.do  i  =  l'/oto  &n_group ; 

7»do  j  =  l“/.to  &n_trt ; 
data  select; 
set  '/.means ; 

if  group  =  &i  and  trt  =  &j; 
keep  estimate; 
run; 

data  _Null_; 
set  select; 

call  symput  (‘mean1 , estimate) ; 
run; 

"/.put  “/.means ; 
data  simulation; 
set  simulation; 

if  group  =  &i  and  trt  =  &j  then  mu  = 
/.means ; 

run; 

“/.end ; 

“/.end ; 

*  Last,  add  up  above  terms  following  model; 
data  simulation; 

set  simulation; 
y  =  mu  +  b  +  e ; 
run; 

%mend; 
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